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ABSTRACT 


With the popularity of social media, there has been an increasing interest in user profiling and its 
applications nowadays. This paper presents our system named UIR-SIST for User Profiling Technology 
Evaluation Campaign in SMP CUP 2017. UIR-SIST aims to complete three tasks, including keywords 
extraction from blogs, user interests labeling and user growth value prediction. To this end, we first extract 
keywords from a user’s blog, including the blog itself, blogs on the same topic and other blogs published by 
the same user. Then a unified neural network model is constructed based on a convolutional neural network 
(CNN) for user interests tagging. Finally, we adopt a stacking model for predicting user growth value. We 
eventually receive the sixth place with evaluation scores of 0.563, 0.378 and 0.751 on the three tasks, 
respectively. 


1. INTRODUCTION 


Social media have recently become an important platform that enables its users to communicate and 
spread information. User-generated content (UGC) has been used for a wide range of applications, including 
user profiling. The Chinese Software Developer Network (CSDN) is one of the biggest platforms of software 
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developers in China to share technical information and engineering experiences. Analyzing UGC on the 
CSDN can uncover users’ interests in the software development process, such as their past interests and 
current focus, even if their user profiles are incomplete or even missing. Apart from the UGC, user behavior 


mH: 


data also contain useful information for user profiling, such as “following,” “replying,” and “sending private 
messages,” through which the friendship network is constructed to indicate user gender [1,2,3], age [A], 


political polarity [5, 6] or profession [7]. 


In SMP CUP 2017 [8], the competition is structured around three tasks based on CSDN blogs®: (1) 
keywords extraction from blogs, (2) user interests labeling and (3) user growth value prediction. Our team 
from School of Information Science and Technology, University of International Relations participated in 
all the tasks in User Profiling Technology Evaluation Champaign. This paper describes the framework of our 
system UIR-SIST for the competition. We first extract keywords from a user’s blog, including the blog itself, 
blogs on the same topic, and other blogs published by the same user. Then a unified neural network model 
is constructed with self-attention mechanism for Task 2. The model is based on multi-scale convolutional 
neural networks with the aim to capture both local and global information for user profiling. Finally, we 
adopt a stacking model for predicting user growth value. According to SMP CUP 2017's metrics, our model 
achieved scores of 0.563, 0.378 and 0.751 on the three tasks, respectively. 


This paper is organized as follows. Section 2 introduces User Profiling Technology Evaluation Campaign 
in details. Section 3 describes the framework of our system. We present the evaluation results in Section 
4. Finally, Section 5 concludes the paper. 


2. EVALUATION OVERVIEW 
2.1 Data Set 


The data set used in SMP CUP 2017 is provided by CSDN, which is one of the largest information 
technology communities in China. The CSDN data set consists of all user generated content and the 
behavior data from 157,427 users during 2015, which can be further divided into three parts: 


1). 1,000,000 pieces of user blogs, involving blog ID, blog title and the corresponding content; 

2). Six types of user behavior data, including posting, browsing, commenting, voting up, voting down 
and adding favorites, and the corresponding date and time information; 

3). Relationship between users, which refers to the records of following and sending private messages. 


More details about the size and type of the CSDN data set are shown in Table 1. 


© https://github.com/LuJunru/SMPCUP2017_ELP 
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Table 1. Statistics of the evaluation data set. 


Attribute Content Size Format 
Blogs Users’ blogs 1,000,000 D0802938/Title/Content 
Behavior Post Record of posting blogs 1,000,000 U0024827/D0874760/2015-02-05 
18:05:49.0 
Browse Record of browsing blogs 3,536,444 U0143891/D0122539/20150919 
09:48:07 
Comment Record of commenting on blogs 182,273 U0075737/D038361 1/2015-10-30 
11:18:32.0 
Vote up Record of clicking a “like” button 95,668 U0111639/D0627490/2015-02-21 
Vote down Record of clicking a “dislike” 9,326 U0019111/D0582423/2015-11-23 


Add favorites 


Relationships Follow 
Send private 
messages 


Record of adding blogs to a users 10,4723 U0014911/D0552113/2015-06-07 


favoriates list 07:05:05 

Record of following relationships 667,037 U0124114/U0020107 

Record of sending private 46, 572 U0079109/U0055181/2015-12-24 
messages 


Table 2 illustrates an example from the given data set. 


Table 2. Sample of CSDN data set. 


Attribute 


Data sample 


User ID 

Blog ID 

Blog content 
Keywords 
Interest tags 
Post 

Browse 
Comment 
Vote up 

Vote down 
Send private messages 
Add favorites 
Follow 
Growth value 


2.2 Tasks 


U00296783 

D00034623 

Title and content. 

Keyword1: TextRank; Keyword2: PageRank; Keyword3: Summary 
Tag1: Big data; Tag2: Data mining; Tag3: Machine learning 
U00296783/D00034623/20160408 12:35:49 
D09983742/20160410 08:30:40 

D09983742/20160410 08:49:02 

D00234899/20160410 09:40:24 

D00098183/20160501 15:11:00 
U00296783/U02748273/20160501 15:30:36 
D00234899/20160410 09:40:44 
U00296783/U02666623/20161119 10:30:44 

0.0367 


Task 1: To extract three keywords from each document that can well represent the topic or the main idea 


of the document. 


Task 2: To generate three labels to describe a user's interests, where the labels are chosen from a given 


candidate set (42 in total). 
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Task 3: To predict each user’s growth value of the next six months according to his/her behavior of the 
past year, including the texts, the relationships and the interactions with other users. The growth 
value needs to be scaled into [0, 1], where O presents user drop-out. 


2.3 Metrics 


To assess the system effectiveness in completing the above-mentioned tasks, the following evaluation 
metrics are designed for each individual task. 


Score, is defined to calculate the overlapping ratio between the extracted keywords and the standard 
answers, which can be computed in Equation (1): 
[KOK 


1“ 
Score, = — ) —————, 1 
i ne IK; | 


where N is the size of the validation set or the test set, K, is the extracted keywords set from document i, 
and kK; is the standard keywords of document i. Note that it is defined that |K;| = 3 and |k;| = 5. 


Score, denotes the overlapping ratio of model tagging and answers, which can be expressed by 
Equation (2): 


LŠITAT 
Score, = — ) ————, 2 
> mA a (2) 


i=1 


where T, is the automatically generated tag set of user i, and 7; is the standard tags of user i. It is also 
defined that |7;| = 3 and |7;| = 3. 


Score, is calculated by relative error between the predicted growth value and the real growth value of 
users, which can be expressed by Equation (3): 


gut 0, m 
Score, = es v; -vi |/max(v, v’), otherwise ' = 


where v; is the predicted growth value of user i, and vý is the real growth value of user i. 
The overall score can be computed by Equation (4): 


Score, = Score, + Score, + Score, (4) 
3. SYSTEM OVERVIEW 


The overall architecture of UIR-SIST is described in Figure 1. UIR-SIST system is comprised of four 
modules: 
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1). Preprocessing module: To read all blogs of training set and test set. It performs word segmentation, 
part-of-speech (POS) tagging, named entity recognition and semantic role labeling; 

2). Keyword extraction module: To extract three keywords to represent the main idea of a blog, which 
can be captured from three aspects to generate the candidate keywords set, including the blog 
content, other blogs published by the same user, and the blogs on the same topic, as shown in the 
green part; 

3). User interests tagging module: To construct a neural network combined with user content 
embedding and keyword and user tag embedding for user interests tagging, as shown in the red part; 

4). User growth value prediction module: To incorporate users’ interaction information and the behavior 
features into a supervised learning model for growth value prediction, as shown in the blue part. 


Result 1 


Keyword 
Candidate 


Generation l 


Keyword 
Embedding A Unified 
Neural 
Blog Network 
Embedding 


Figure 1. System architecture. 


Feature 


Selection 


3.1 Keywords Extraction 


The objective of Task 1 is to extract three keywords from each blog that can represent the main idea of 
the blog. In our opinion, the main idea can be extracted from the following three aspects, the blog itself, 
other blogs published by the same user, and the blogs on the same topic. Based on this assumption, we 
adopt three different models that can capture each aspect to generate a candidate keywords set, including 
tf-idf, TextRank and LDA, which are proved very effective in the relevant tasks. Then three keywords are 
extracted from the candidate set by using different rules. 


We first adopt the classic tf-idf term weighting scheme to reflect the content of the blog itself. Then 
we rank the keywords based on the tf-idf score, and select the top 100 keywords to form the candidate 
keyword set. 
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Regarding the blogs on the same topic, we adopt TextRank approach [9] to cluster these blogs together. 
Meanwhile, all the keywords will be weighed during this process. We finally select the top 300 keywords. 


Moreover, we utilize topic information to extract the keywords. Since 42 categories of tags are given in 
Task 2, we assume that these 42 topics are extracted from all the blogs. Therefore, we use Latent Dirichlet 
Allocation (LDA) model [10] to extract top 100 keywords for each category from 1,000,000 blogs, and thus 
obtain the interspecific distribution information of these 4,200 subject keywords. 


In summary, we consider three aspects in order to reflect the blog content and obtain three independent 
candidate keywords sets, which are extracted through tf-idf model, TextRank model and LDA model. After 
that, we only save the intersection data set. In our training set of Task 1, about 5,000 keywords are provided, 
which are collected after extraction and deduplication. 


A drawback of the classic tf-idf model is that it simply presupposes that the rarer a word is in corpus, 
the more important it is, and the greater its contribution is to the main idea of the text. However, when 
referring to a group of articles, which mainly use the same keywords and describe some similar concepts, 
the calculation results will have many errors. This is also the reason why we use tf-idf in the short blog, 
while we use the TextRank model in the long blog collection published by the same user. 


In addition, in order to enhance its cross-topic analysis ability, we borrow the idea of 2016 Big Data & 
Computing Intelligence Contest sponsored by China Computer Federation (CCF)®, and implement the 
improvements on the results of traditional tf-idf calculation, and obtain the result of S-TFIDF(w) by using 
Equation (5): 


S — TFIDF (w) = TFIDF (w) * (= - a (5) 


where C,, is the frequency of word w appearing in 42 categories. 


3.2 User Interests Tagging 


The objective of this task is to tag a user’s interests with three labels from 42 given ones. We model this 
task with neural networks, and the model structure is shown in Figure 2. Each blog is represented by a blog 
embedding [11] through convolution and max-pooling layers. Then we obtain a user’s content embedding 
from weighted sum of all of his or her blog embeddings. The weighted value of each blog embedding is 
counted by self-attention mechanism. Content embedding and keyword embedding are concatenated as 
user embedding, and finally fed to the output layer. 


®  https://github.com/coderSkyChen/2016CCF_BDCI_Sougou 
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Figure 2. Framework of CNNs model based on weighted-blog-embeddings in Task 2. 


In our system, a convolutional neural network (CNN) model is constructed for blog representation instead 
of a recurrent neural network (RNN), since more global information will be captured for indicating the user 
interests and the time efficiency will also be enhanced. It is widely acknowledged that a multi-scale 
convolutional neural network [12] has been implemented due to its outstanding achievement on computer 
vision [13], and TextCNNs designed by arraying word embedding vertically has also shown quite high 
effectiveness for natural language processing (NLP) tasks [14]. 


In our CNN model, we treat a blog as a sequence of words x = [x;, Xx, ... , X;] where each one is 
represented by its word embedding vector, and returns a feature matrix S of the blog. The narrow convolution 
layer attached after the matrix is based on a kernel W e R" of width k, a nonlinear function f and a bias 
variable b as described by Equation (6): 


h=f(W,.,, +6), (6) 


sjtk~ 


where x; refers specifically to the concatenation of the sequence of words’ vectors from position i to 
position j. In this task, we use several kernel sizes to obtain multiple local contextual feature maps in the 
convolution layer, and then apply the max-overtime pooling [15] to extract some of the most important 
features. 
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The output of that is the low-dimensional dense and quantified representation of each single blog. After 
that, each user’s relevant blogs are computable. We simply average their blogs’ vectors to obtain the content 
embedding c(u) for an individual user: 


1 T 
c)=235, o) 
i=1 
where T is the total number of a user’s related blogs. 


However, different sources of blogs imply the extent of a user’s interest in different topics. For example, 
a blog posted by a user may be generated from an article written by himself, reposted by other users, or 
shared by users from another platform. It is natural that we may pay attention to these blogs in varying 
degrees when we infer this user’s interests. Thus, a self-attention mechanism is introduced, which 
automatically assigns different weights to the value of each user's blog after training. The user context 
representation is given by weighted summation of all blogs’ vectors: 


exp(e,) 
a = APC (8) 
dé 
j=l 
e, = v'tanh(Ws, + Uh), (9) 
T 
c(u)= Yah, (10) 
i=1 


where «; is the weight of the i-th blog, s; is the one-hot source representation vector of the blog, v e R’, 
We R™™, Ue R”*", se R™, h,e R, and m is the number of all source platforms. 


When we finish a user’s context representation, the keyword matrix of all blogs’ keywords extracted by 
our model in Task 1 will be concatenated. The final features are the output of above whole feature 
engineering. Afterwards, an ANN layer trains the user embeddings from the training set and predicts 
probability distribution of users’ interests among 42 tags in validation and test set according to their 
embeddings. 


3.3 User Growth Value Prediction 


According to the description of Task 3, the growth value can be estimated as the degree of activeness. 
Therefore, our basic idea is to incorporate a users’ interaction information and his or her behavior statistical 
features into a supervised learning model. The procedure of Task 3 is demonstrated by Figure 3. 
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Behavior Statistics 
Feature Selection 


Passive Aggressive 


Final Result 


Figure 3. Framework of the stacking model in Task 3. 


On the whole, we use a stacking framework [16] to enhance the accuracy of final prediction. After the 
basic behavior statistics analysis, the original features are selected as the inputs incorporated into the 
stacking model. Then, the stacking model is divided into two layers, the base layer and the stacking layer. 
In the base layer, we choose Passive Aggressive Regressor [17] and Gradient Boosting Regressor [18, 19] 
as the group of basic regressors due to their excellent performance. In the stacking layer, we still use the 
support vector machines (SVM) model, especially, the NuSVR model, which can control its error rate. 
Finally, we obtain the final results of user growth value. 


3.3.1 Original Feature Selection 


Figure 4 illustrates an example of the daily statistics of user behaviors, including posting, browsing, 
commenting, voting up, voting down, adding favorites, following, and sending private messages. To predict 
the user growth value, it is noted that the dynamic changes of behaviors along the time line are more useful. 
To avoid the sparse data problem, we adopt the monthly statistics of user behaviors rather than daily 
statistics. 


Figure 4. Example of daily statistics of user behaviors. Note: “Add” refers to “add favoriates”, and “send” refers to 
“send private messages”. 
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Then we use correlation analysis to exclude the “vote down” behavior because of its negative contribution 
to model prediction. After that, through feature selection, we use the average, log calculation and growth 
rate of the original data to obtain features for the stacking model. 


LOG(d) = log(d + 1, (11) 
d -d 

R — 2H t 12 

GR(d,) oe (12) 


where LOG(d) represents the calculation results of data d after adjustment, and GR(d,) represents the 
calculation results of growth value from data d, in month t to data d, in month +7. 


3.3.2 PAR/GDR-NuSVR-Stacking Model (PGNS) 


Once we have obtained monthly statistics and derivative features as described above, the combination 
of them will be sent as inputs into Passive Aggressive Regressor and Gradient Boosting Regressor 
independently. By averaging the predictions of those two base models, a new feature will be created and 
input into the stacking model NuSVR. Because of the inherent randomness of base models, we adopt a 
self-check mechanism of 10-fold cross validation. 


If the trained model obtains a score higher than the threshold S* under given scoring rules, we will enter 
the corresponding features of validation set or test set into the model for a prediction value, which will be 
saved into a candidate set. On the contrary, if the trained model obtains a 10-fold cross validation score 
that is lower than S*, the model will be discarded and the program will return to the training session shown 
in the dotted box for a new round of training. 


In order to reduce the errors of a single round of training, we set at least R* rounds for training and add 
all predictions that obtain higher scores than 5* to the candidate set. According to our experience, the ratio 
of the size of a candidate set to R* is about 0.45. When all rounds of trainings are completed, all predictions 
in the candidate set will be calculated to generate an average prediction as the final results. 


4. EVALUATION 


In our model, we first adopt Jieba® toolkit for Chinese word segmentation, and then train a word 
embedding with the dimensions of 300 [11]. 


Table 3 shows the comparison results of our proposed approach for Task 1. It is observed that the best 
results are achieved when data of all the three aspects are used for capturing the main ideas of blogs. 


©  https://github.com/fxsjy/jieba 
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Table 3. Comparison on Task 1 with different aspects. 


Approach Results 
BI: Blog itself 0.505 
ST: Same topic 0.371 
SU: Same user 0.436 
BI+ST+SU 0.563 


Besides, we also test performance of our combined neural network with different embedding inputs. 
Note that to obtain the results of individual embedding, we train a new CNN model for blog embedding, 
and compute the similarity between blog content and keywords in the embedding representation. The 
experimental results are summarized in Table 4. It is observed that the embedding of blog content proves 
more effective than that of keywords, while they together achieve the best run. 


Table 4. Comparison of different aspects on Task 2. 


Approach Results 
Blog embedding 0.301 
Keywords embedding 0.245 
Blog + keywords embedding 0.378 


Table 5 displays the overall performance of our system’s best run on each individual task, which achieved 
the sixth place in the competition. 


Table 5. Performance of UIR-SIST system in SMP CUP 2017. 


Task 1 Task 2 Task 3 Total 
Training set (10 Fold) 0.610 0.390 0.765 1.765 
Validation set 0.560 0.390 0.730 1.680 
Test set 0.563 0.378 0.751 1.692 


5. CONCLUSIONS AND FUTURE WORK 


In this paper, we present our system built for the User Profiling Technology Evaluation Campaign of SMP 
CUP 2017. To complete Task 1, we propose to extract keywords from three aspects from a user’s blogs, 
including the blog itself, blogs on the same topic, and other blogs published by the same user. Then a 
unified neural network model with self-attention mechanism is constructed for Task 2. The model is based 
on multi-scale convolutional neural networks with the aim to capture both local and global information 
for user profiles. Finally, we adopt a stacking model for predicting user growth value. According to SMP 
CUP 2017's metrics, our model runs achieved the final scores of 0.563, 0.378 and 0.751 on three tasks, 
respectively. 
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Future work includes analysis of the relationships between users and blogs. We only use the users’ 
behavior in Task 2 in the current system, but the time when blogs are published is ignored. We plan to 
include network embedding into our model. Moreover, we will collect more blogs with real time information, 
and attempt to incorporate the time information into our weighting schema in those tasks. 
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